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Abstract 

Advancements  in  beamforming  algo¬ 
rithms  are  exceeding  the  computation  and 
communication  capabilities  of  traditional 
sonar  array  systems.  Implementing  parallel 
beamforming  algorithms  in  situ  on  distrib¬ 
uted  array  systems  holds  the  potential  to 
provide  increased  performance  and  fault  tol¬ 
erance  at  a  lower  cost.  This  paper  compares 
three  parallel  algorithms  for  distributed  ar¬ 
rays  in  terms  of  execution  throughput,  result 
latency,  scaled  speedup,  and  parallel  effi¬ 
ciency. 

1.  Introduction 

Passive  sonar  beamforming  is  a  class  of 
array  processing  that  optimizes  an  array  gain 
in  a  direction  of  interest  to  detect  and  locate 
objects  in  an  undersea  environment.  Beam¬ 
forming  algorithms  are  particularly  vital  in 
radar  and  sonar  applications.  The  parallel 
algorithms  considered  here  are  designed  for 
a  distributed  array  of  sonar  transducer 
nodes,  each  with  its  own  processing  element 
and  interconnected  by  a  network.  The  neces¬ 
sity  for  parallel  beamforming  algorithms  is  a 
direct  result  of  the  development  of  advanced 
beamforming  algorithms  that  are  better  able 
to  cope  with  quiet  sources  and  cluttered  en¬ 
vironments.  These  developments  have  re¬ 
sulted  in  increased  demands  for  real-time 
computation.  Moreover,  the  use  of  larger 
sonar  arrays  has  in  turn  led  to  larger  problem 
sizes.  Thus,  a  beamformer  based  on  a  cen¬ 
tralized  processing  system  may  prove  insuf¬ 
ficient  to  meet  these  demands,  and  parallel 


processing  in-situ  on  a  distributed  array  is  a 
promising  alternative. 

2.  Overview  of  Beamforming  Algorithms 

The  basic  operation  in  most  beamform¬ 
ing  algorithms  is  to  sum  the  manipulated 
outputs  from  many  spatially  separated  sen¬ 
sors.  The  three  parallel  beamforming  algo¬ 
rithms  discussed  in  this  paper  are  based  on 
Conventional  Beamforming  (CBF),  Split- 
Aperture  Conventional  Beamforming  (SA- 
CBF)  [1]  and  an  Adaptive  Beamforming 
(ABF)  algorithm  for  subspace  projection  [2], 
respectively. 

2.1  Conventional  Beamforming  (CBF) 

In  a  sonar  array,  the  determination  of  the 
direction  of  arrival  relies  on  the  detection  of 
the  time  delay  of  the  signal  between  sensors. 
In  CBF,  signals  sampled  across  an  array  are 
linearly  phased  (i.e.  delayed)  assuming  a 
configuration  with  uniform  distance  between 
elements  in  the  array.  Incoming  signals  are 
steered  by  complex-number  vectors  called 
steering  vectors.  If  the  beamformer  is  prop¬ 
erly  steered  to  an  incoming  signal,  the  multi¬ 
channel  input  signals  will  be  amplified  co¬ 
herently,  maximizing  the  beamformer  output 
power.  Otherwise,  the  output  of  the 
beamformer  is  attenuated  to  some  degree. 
Thus,  peaks  in  the  beamforming  output  indi¬ 
cate  directions  of  arrival  for  sources.  The 
output  power  of  CBF  for  each  steering  angle 
9  is  defined  as 

PCBF  (9)  =  s(9)*  ■  R  ■  s(9)  (1) 
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where  s(9)  is  the  steering  vector,  R  is  the 
Cross-Spectral  Matrix  (CSM),  and  operator 
*  indicates  complex-conjugate  transposition. 

2.2  Split-Aperture  CBF  (SA-CBF) 

SA-CBF  is  based  on  single-aperture 
conventional  beamforming  in  the  frequency 
domain.  The  beamforming  array  is  logically 
divided  into  two  sub-arrays.  Each  sub-array 
independently  performs  CBF  using  steering 
vectors  on  its  own  data.  The  two  sub-array 
beamforming  outputs  are  cross-correlated  to 
detect  the  time  delay  of  the  signal  for  each 
steering  angle.  Interpolation  is  used  to  gen¬ 
erate  outputs  for  angles  other  than  the  steer¬ 
ing  angles  (e.g.  a  ratio  of  4-to-l  is  used  in 
this  study).  The  cross-correlated  data,  with 
knowledge  of  the  steering  angles  and  several 
other  parameters,  will  map  the  final 
beamforming  output.  Figure  1  shows  the 
block  diagram  of  the  SA-CBF  algorithm. 


vectors.  The  reciprocal  of  steered  noise  sub¬ 
space  indicates  peak  points  at  signal  loca¬ 
tions.  However,  subspace  identification  to 
separate  noise  and  signal  using  the  Singular 
Value  Decomposition  (SVD)  is  computa¬ 
tionally  expensive  to  perform  and  difficult  to 
implement  in  a  parallel  algorithm  due  to  the 
many  dependencies  between  the  computa¬ 
tional  tasks.  Instead  of  using  the  eigenvec¬ 
tors  of  CSM  matrix,  the  columns  of  the  Q 
matrix  are  used,  which  correspond  to  the 
noise  subspace.  The  Q  matrix  is  from  the 
QR  decomposition  of  the  CSM  matrix  using 
elementary  reflectors  in  this  study.  The  out¬ 
put  of  the  subspace  beamformer  is  defined 
as 


P ABF  )  ~ 


1 

s*(9)EnE*ns(9 ) 


(2) 


where  EN  is  the  columns  of  Q  matrix  corre¬ 
sponding  to  the  noise  space. 


Figure  1:  Block  diagram  of  SA-CBF 


2.3  Adaptive  Beamforming  (ABF) 

The  ABF  algorithm  used  in  this  study  is 
a  subspace-projection  beamformer  based  on 
QR  decomposition  [2].  Subspace  beamform¬ 
ing  algorithms  for  ABF  such  as  MUSIC 
make  use  of  the  property  that  eigenvectors 
associated  with  noise  are  orthogonal  to  the 
space  spanned  by  the  incident  signal  mode 


3.  Parallel  Beamforming  Algorithms 

In  a  distributed  sonar  array  for  parallel 
processing  in  situ,  the  degree  of  parallelism 
is  linked  to  the  number  of  physical  nodes  in 
the  system.  However,  an  increase  in  the 
number  of  nodes  increases  the  problem  size. 
The  goal  is  to  obtain  minimum  processor 
stalling  through  equal  distribution  of  work 
and  minimum  communication  overhead. 

The  method  of  parallelization  employed 
by  the  parallel  algorithms  in  this  paper, 
known  as  iteration  decomposition  [3,4],  fo¬ 
cuses  on  the  partitioning  of  beamforming 
jobs  across  iterations,  with  each  iteration 
processing  a  different  set  of  array  input 
samples.  Successive  iterations  are  assigned 
to  successive  processors  in  the  array  and  are 
overlapped  in  execution  with  one  another  by 
pipelining.  A  single  node  performs  the 
beamforming  task  for  a  given  sample  set 
while  the  other  nodes  simultaneously  work 
on  their  respective  beamforming  iterations. 
At  the  beginning  of  every  iteration,  each 
node  executes  an  FFT  on  data  that  has  been 


2 


newly  collected  by  its  sensor,  and  the  results 
are  communicated  to  other  processors  before 
the  beamforming  for  that  iteration  com¬ 
mences.  The  block  diagram  in  Figure  2  il¬ 
lustrates  the  manner  in  which  beamforming 
iterations  are  distributed  across  the  nodes  in 
the  distributed  array,  in  this  case  using  3 
nodes. 


Node  0 


Figure  2:  Iteration  decomposition 


Each  processor  calculates  an  index  based 
on  its  node  number,  the  current  job  number 
and  the  number  of  nodes.  This  index  tells 
the  node  from  which  point  in  its  iteration  it 
must  continue  after  executing  the  FFT  and 
communication  stages  and  when  it  must 
pause  to  begin  another  iteration. 

The  communication  pattern  that  can  be 
expected  in  the  CBF  and  SA-CBF  algo¬ 
rithms  is  an  ‘all-to-one’  pattern,  as  only  one 
of  the  nodes  needs  to  receive  the  data  sam¬ 
pled  each  cycle  to  perform  a  given 
beamforming  iteration  on  data  collected 
throughout  the  array.  However,  ABF  algo¬ 
rithms  require  ‘all-to-all’  communication  so 
that  the  cross-spectral  matrix  on  each  of  the 
nodes  is  updated  with  each  cycle  of  sam¬ 
pling.  Thus,  to  provide  a  common  frame¬ 
work  for  comparisons,  an  ‘all-to-all’  type  of 
communication  is  used  with  all  the  parallel 
algorithms,  where  each  node  sends  its  Fou¬ 
rier  transformed  data  to  all  other  nodes. 

4.  Parallel  Performance  Analysis 

The  parallel  algorithms  in  this  paper 
were  implemented  in  MPI-C  and  executed 
on  a  cluster  of  SPARCstation-20  worksta¬ 
tions  connected  by  a  155  Mb/s  ATM  net¬ 


work.  In  the  experiments  described  in  this 
section,  a  sampling  frequency  of  1500Hz  is 
assumed  and  the  beamforming  is  performed 
for  200  frequency  bins.  The  FFT  length  of 
the  processed  data  is  2048  samples,  and  no 
frequency-bin  averaging  is  performed.  Ap¬ 
proximately  180  steering  angles  are  resolved 
(i.e.  181  for  CBF  and  ABF,  and  177  for  SA- 
CBF),  with  a  4-to-l  ratio  of  interpolation 
used  with  SA-CBF. 

It  is  apparent  from  the  results  in  Figure  3 
that  the  effective  execution  times  of  the  par¬ 
allel  algorithms  are  much  lower  than  their 
sequential  counterparts.  Moreover,  the  in¬ 
crease  in  parallel  execution  time  as  the  prob¬ 
lem  size  increases  is  less  pronounced  than  in 
the  sequential  case.  Thus,  the  parallel  algo¬ 
rithms  are  seen  to  provide  a  higher  through¬ 
put  of  execution. 


Figure  3:  Beamformer  execution  times 


The  execution  time  for  parallel  SA-CBF 
is  always  less  than  that  for  both  parallel  CBF 
and  parallel  ABF.  Unlike  SA-CBF,  the  CBF 
and  ABF  algorithms  must  directly  process 
for  all  steering  angles,  no  interpolation  is 
performed,  and  hence  they  involve  more 
computation.  This  characteristic  is  an  inher¬ 
ent  strength  of  the  SA-CBF  algorithm.  ABF 
requires  a  higher  execution  time  than  both 
CBF  and  SA-CBF  because  of  the  complex¬ 
ity  of  the  QR  decomposition  stage. 

Figure  4  shows  the  result  latencies  for 
the  different  algorithms.  Result  latency  re- 
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fers  to  the  time  required  for  the  final  output 
to  be  available  after  the  data  has  been  read 
by  the  sensors.  In  the  case  of  sequential  al¬ 
gorithms,  result  latency  is  the  same  as  the 
execution  time,  as  there  is  no  pipelining  in¬ 
volved. 
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This  outcome  is  a  result  of  the  presence  of 
loops  of  different  sizes,  different  number  of 
steering  angles  and  different  number  of  out¬ 
put  angles,  leading  to  a  slight  imbalance. 
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Figure  5:  Beamformer  Scaled  Speedup 


Figure  4:  Beamformer  Result  Latency 

Result  latencies  with  the  parallel  algo¬ 
rithms  are  slightly  higher  than  those  of  their 
sequential  counterparts.  This  difference  can 
be  attributed  to  the  fact  that  each  beamform¬ 
ing  job  has  been  divided  into  pipeline  stages 
and  hence  involves  pipeline  management 
overhead  and  communication  time  between 
successive  stages.  Thus,  there  is  an  obvious 
tradeoff  between  execution  throughput  and 
result  latency  when  using  parallel  algorithms 
based  on  the  technique  of  iteration  decom¬ 
position. 

Speedup  is  defined  as  the  ratio  of  the  se¬ 
quential  execution  time  versus  the  parallel 
execution  time,  where  ideal  speedup  is  equal 
to  the  number  of  processors  employed. 
Scaled  speedup  recognizes  that,  in  this  case, 
an  increase  in  the  number  of  processors 
brings  with  it  an  increase  in  the  problem 
size,  since  each  node  possesses  both  a  proc¬ 
essor  and  a  sensor.  As  seen  in  Figure  5,  the 
scaled  speedups  for  the  three  parallel  algo¬ 
rithms  are  observed  to  be  near  linear.  How¬ 
ever,  for  a  higher  number  of  nodes,  parallel 
SA-CBF  appears  to  provide  a  lower  speedup 
compared  to  parallel  CBF  and  parallel  ABF. 


Parallel  efficiency  is  defined  as  the  ratio 
of  speedup  versus  the  number  of  processing 
nodes  (i.e.  the  ideal  speedup).  As  illustrated 
in  Figure  6,  the  parallel  algorithms  achieve 
levels  of  parallel  efficiency  in  the  range  of 
77-91%,  with  an  average  of  approximately 
80%  for  the  largest  cases.  Since  communica¬ 
tion  overhead  plays  a  more  significant  role 
as  the  size  of  the  system  increases,  the  effi¬ 
ciencies  decrease  slightly  with  increase  in 
the  number  of  nodes.  However,  the  rela¬ 
tively  flat  nature  of  these  results  demon¬ 
strates  that  scalability  is  achieved  at  least  for 
arrays  of  moderate  size  and  complexity. 

5.  Conclusions 

This  paper  has  presented  a  comparative 
analysis  of  the  performance  of  several  paral¬ 
lel  algorithms  for  in-situ  beamforming  on  a 
distributed  system.  Each  of  the  algorithms 
is  based  on  the  same  technique  of  pipelined 
decomposition,  where  consecutive  iterations 
of  the  beamforming  process  are  scheduled  in 
a  round-robin  fashion  to  execute  on  con¬ 
secutive  processing  nodes  in  the  array. 
These  algorithms  were  implemented  as  mes¬ 
sage-passing  parallel  programs  and  executed 
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on  a  cluster  of  workstations  connected  by 
ATM  and  their  performance  measured. 
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Figure  6:  Beamformer  Parallel  Efficiency 


With  respect  to  execution  time,  the  par¬ 
allel  algorithms  demonstrate  a  consistent 
relationship  regardless  of  the  system  size, 
where  split-aperture  CBF  performs  the  fast¬ 
est,  followed  by  the  single- aperture  CBF  and 
lastly  the  ABF.  The  sequential  algorithms 
demonstrate  this  same  trend.  However,  de¬ 
spite  the  increased  complexity  associated 
with  ABF,  the  results  in  these  experiments 
indicate  that  their  execution  throughput  su¬ 
persedes  the  simple  CBF  by  at  most  only  a 
factor  of  2. 

One  of  the  disadvantages  of  using  a 
pipelined  approach  to  parallel  processing  is 
an  increase  in  result  latency.  However, 
measurements  indicate  that  the  pipelining 
overhead  that  increases  the  latency  in  pro¬ 
ducing  results  is  marginal. 

Finally,  the  scaled  speedup  and  parallel 
efficiency  achieved  with  each  of  the  parallel 
beamformers  was  found  to  be  within  ap¬ 
proximately  80%  of  the  ideal  for  systems  of 
four,  six,  and  eight  nodes.  The  general 
trends  indicate  that  comparable  performance 
can  be  expected  for  larger  arrays,  since  the 
decrease  in  efficiency  as  system  size  in¬ 
creases  is  relatively  slow. 

The  parallel  beamforming  algorithms 
compared  in  this  paper  present  many  oppor¬ 
tunities  for  increased  performance,  reliabil¬ 


ity,  and  flexibility  in  a  distributed  system  for 
sonar  signal  processing.  Undertaking  and 
coordinating  the  computations  and  commu¬ 
nications  to  perform  beamforming  in  situ  is 
a  challenging  task,  and  is  becoming  more  so 
as  the  beamformers  themselves  continue  to 
become  more  sophisticated.  Some 
beamformers,  such  as  Minimum  Variance 
Distortionless  Response  (MVDR),  exhibit 
an  even  larger  degree  of  communication 
overhead  and  thus  require  a  more  elaborate 
scheme  to  achieve  reduction  and  hiding  of 
communication  latency  [5]. 

Future  research  activities  on  the  subject 
of  parallel  algorithms  for  in-situ  processing 
on  distributed  arrays  will  continue  to  focus 
on  adaptive  techniques  in  the  near  term. 
However,  new  studies  and  developments  are 
underway  to  help  address  the  tremendous 
challenges  in  computation  and  communica¬ 
tion  associated  with  advanced  beamforming 
in  the  littoral  environment  using  matched- 
field  processing. 
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